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Abstract 

Global sensitivity analysis of complex numerical models can be performed by calculat- 
ing variance-based importance measures of the input variables, such as the Sobol indices. 
However, these techniques, requiring a large number of model evaluations, are often un- 
acceptable for time expensive computer codes. A well known and widely used decision 
consists in replacing the computer code by a metamodel, predicting the model responses 
with a negligible computation time and rending straightforward the estimation of Sobol 
indices. In this paper, we discuss about the Gaussian process model which gives analytical 
expressions of Sobol indices. Two approaches are studied to compute the Sobol indices: 
the first based on the predictor of the Gaussian process model and the second based on 
the global stochastic process model. Comparisons between the two estimates, made on 
analytical examples, show the superiority of the second approach in terms of convergence 
and robustness. Moreover, the second approach allows to integrate the modeling error 
of the Gaussian process model by directly giving some confidence intervals on the Sobol 
indices. These techniques are finally applied to a real case of hydrogeological modeling. 

Keywords: Gaussian process, covariance, metamodel, sensitivity analysis, uncertainty, 
computer code. 
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1 INTRODUCTION 

Environmental risk assessment is often based on complex computer codes, simulating for 
instance an atmospheric or hydrogeological pollution transport. These computer mod- 
els calculate several output values (scalars or functions) which can depend on a high 
number of input parameters and physical variables. To provide guidance to a better un- 
derstanding of this kind of modeling and in order to reduce the response uncertainties 
most effectively, sensitivity measures of the input importance on the response variability 
can be useful (Saltelli et al. [24], 



Klcijnen 



12], Helton et al. (9(). However, the estimation 



of these measures (based on Monte-Carlo methods for example) requires a large number 
of model evaluations, which is unacceptable for time expensive computer codes. This kind 
of problem is of course not limited to environmental modeling and can be applied to any 
simulation system. 

To avoid the problem of huge calculation time in sensitivity analysis, it can be useful to 
replace the complex computer code by a mathematical approximation, called a response 
surface or a surrogate model or also a metamodel. The response surface method (Box & 
Draper [2|) consists in constructing a function from few experiments, that simulates the 
behavior of the real phenomenon in the domain of influential parameters. These methods 
have been generalized to develop surrogates for costly computer codes (Sacks et al. 23[ , 



Kleijnen & Sargent [3]). Several metamodels are classically used: polynomials, splines, 
generalized linear models, or learning statistical models like_neural networks, regression 
trees, support vector machines (Chen et al. 



Our attention is focused on the Gaussian process model which can be viewed as an 



Fang et al. |8(). 
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23j). This 



181 ]. Cressie 



Sacks et al. 



extension of the kriging principles (Matheron 
metamodel which is characterized by its mean and covariance functions, presents several 
advantages: it is an exact interpolator and it is interpretable (not a black-box function). 
Moreover, numerous authors (for example, Currin et al. pfl, Santner et al. [251 ], Vazquez et 
al. 



Rasmussen & Williams 22]) have shown how this model can provide a statistical 
basis for computing an efficient predictor of code response. In addition to its efficiency, this 
model gives an analytical formula which is very useful for sensitivity analysis, especially for 



2J, Saltelli 



the variance-based importance measures, the so-called Sobol indices (Sobol 
et al. Ql). To derive analytical expression of Sobol indices, Chen et al. Q used tensor- 
product formulation and Oakley & O'Hagan [2fJ considered the Bayesian formalism of 
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Gaussian processes. 

We propose to compare these two analytical formulations of Sobol indices for the 
Gaussian process model: the first is obtained considering only the predictor, i.e. the 
mean of the Gaussian process model (Chen et al. 4f), while the second is obtained using 
all the global stochastic model (Oakley & O'Hagan 20(). In the last case, the estimate 
of a Sobol index is itself a random variable. Its standard deviation is available and we 
propose an original algorithm to estimate its distribution. Consequently, our method leads 
to build confidence intervals for the Sobol indices. To our knowledge, this information 
has not been proposed before and can be obtained thanks to the analytical formulation of 
the Gaussian process model error. This is particularly interesting in practice, when the 
predictive quality of the metamodel is not high (because of small learning sample size for 
example), and our confidence on Sobol index estimates via the metamodel is poor. 

The next section briefly explains the Gaussian process modeling and the Sobol indices 
defined in the two approaches (predictor-only and global model). In section 3, the numer- 
ical computation of a Sobol index is presented. In the case of the global stochastic model, 
a procedure is developed to simulate its distribution. Section 4 is devoted to applications 
on analytical functions. First, comparisons are made between the Sobol indices based 
on the predictor and those based on the global model. The pertinence of simulating all 
the distribution of Sobol indices is therefore evaluated. Finally, Sobol indices and their 
uncertainty are computed for a real data set coming from a hydrogeological transport 
model based on waterflow and diffusion dispersion equations. The last section provides 
some possible extensions and concluding remarks. 



2 SOBOL INDICES WITH GAUSSIAN PROCESS 
MODEL 

2.1 Gaussian process model 

Let us consider n realizations of a computer code. Each realization y(x) of the computer 
code output corresponds to a d-dimensional input vector x = (xi, ...,Xd)- The n input 
points corresponding to the code runs are called an experimental design and are denoted 
as X s = (kW, ...,£(")). The outputs will be denoted as Y s = (yW, ...,y {n) ) with = 
y(x^), i = 1, n. Gaussian process (Gp) modeling treats the deterministic response y(x) 
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as a realization of a random function Y(x), including a regression part and a centered 
stochastic process. The sample space f2 denotes the space of all possible outcomes ui, 
which is usually the Lebesgue-measurable set of real numbers. The Gp is defined on 
R d x and can be written as: 

Y{x,u) = f(x) + Z(x,u). (1) 

In the following, we use indifferently the terms Gp model and Gp metamodel. 

The deterministic function f(x) provides the mean approximation of the computer 
code. Our study is limited to the parametric case where the function / is a linear combi- 
nation of elementary functions. Under this assumption, f(x) can be written as follows: 

k 

3=0 

where (3 = \]3o, . . . , /3fc]' is the regression parameter vector, fj (j — I, . . . , k) are basis 
functions and F(x) — [fo(x), . . . , fk{x)] is the corresponding regression matrix. In the 
case of the one-degree polynomial regression, (d + 1) basis functions are used: 

fo(x) = 1, 

fi{x) = Xi for i = l,...,d. 

In our applications, we use this one-degree polynomial as the regression part in or- 
der to simplify all the analytical numerical computation of sensitivity indices. This can 
be extended to other bases of regression functions. Without prior information on the 
relationship between the output and the inputs, a basis of one-dimensional functions is 
recommended to simplify the computations in sensitivity analysis and to keep one of the 
most advantages of Gp model (Martin & Simpson [17||). 

The stochastic part Z(x,w) is a Gaussian centered process fully characterized by its 
covariance function: Covq(Z(x, u>), Z(u,lj)) = <r 2 B,(x,u), where a 2 denotes the variance 
of Z and R is the correlation function that provides interpolation and spatial correlation 
properties. To simplify, a stationary process (Z(x, to)) is considered, which means that 
the correlation between Z{x,oS) and Z(u,lu) is a function of the difference between x 
and u. Moreover, our study is restricted to a family of correlation functions that can be 
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written as a product of one-dimensional correlation functions: 

d 

Cov a (Z(x, w),Z{u, u)) = o 2 R(x u) = a 2 JJ R^xi - u,). (2) 

This form of correlation function is particularly well adapted to get some simplifications 
of the integrals in the future analytical developments: in the case of independent inputs, 
it implies the computation of only one or two-dimensional integrals to compute the Sobol 
indices. Indeed, as described in section T3. 21 the application and the computation of the 
Sobol index formulae are simplified when the correlation function has the form of a one- 
dimensional product (Santner et al. [2510 . 

n n 

Among other authors, Chiles & Delfiner [5j and Rasmussen & Williams [22J give a list 
of correlation functions with their advantages and drawbacks. Among all these functions, 
our attention is devoted to the generalized exponential correlation function: 

d 

Rq p {x -u) = JJexp(-0j|xi - ui\ Pl ) with 9 t > and <pi < 2, 

i=i 

where 6 = [8i, . . . , 6*^]* and p = [pi, . . . ,Pd]' are the correlation parameters. This choice is 
motivated by the derivation and regularity properties of this function. Moreover, within 
the range of covariance parameters values, a wide spectrum of shapes are possible: for 
example p — 1 gives the exponential correlation function while p — 2 gives the Gaussian 
correlation function. 

2.2 Joint and conditional distributions 

Under the hypothesis of a Gp model, the learning sample Y s follows a multivariate normal 
distribution pn(Y s \X S ): 

P n(Y s ,uj\X s )=^(F s f3,-E s ), 
where F s — [F^ 1 )) 4 , . . . , F(x^ n ' t )] is the regression matrix and 



ZJ — 1...TL 



is the covariance matrix. 

If a new point x* = {x\, x^) is considered, the joint probability distribution of 
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(Y s ,Y(x*,uj)) is: 

pn{Y s ,Y(x*,u)\X s ,x*,P,a,e,p) =M 
with 





F s 




( 











fe(jc*) 



fc(a;*)* 



(3) 



(4) 



k(x*) = (Covn(i/W,l r (x* > w)) J ... > Covn(i/W,K(x* ) w))) t 
= a 2 ( ^(aW, as*), . . . , Rq^{x^\x*) )*. 

By conditioning this joint distribution on the learning sample, we can readily obtain 
the conditional distribution of which is Gaussian (von Mises 



p Q (Y(x*,u)\Y s ,X s ,x*,(3,a,0,p) 

= Af(E Q [Y(x*,Lj)\Y s ,X s ,x*,p,a,e,p],YET Q [Y(x*,ij)\Y s ,X s ,x*,l3,a,e,p\), 



(5) 



with 



£^[r(x*,w)|y.,JC a ,x*,j9,£r,e,p] - ^(x^ + fe^S; 1 ^-^), (6) 
Var n [Y(x*,u;)\Y a ,X s ,x*,f3,a,0,p] = a 2 - fcfx*)^ 1 *^*). (7) 

The conditional mean of Eq. ([S]) is used as a predictor. The conditional variance 
formula of Eq. ([7]) corresponds to the mean squared error (MSE) of this predictor and 
is also known as the kriging variance. As we obtained the distribution for a new point 
conditionally to the learning sample, we can consider the covariance between two new 
sites. A Gp conditional to the learning sample is obtained and denoted as follows: 



{Y\Y s ,X s ,p,a,0,p)~Gp( E n [Y(x*,oj)\Y s ,X s ,f3,a,0,p], 

CoY Q (Y(x*,u),Y(u*,cj)\Y s ,X s ,x*,i3,a,9,p)) 

with the same expression for the conditional mean than Eq. ([5]) and 
Covo (Y(x*, w), Y(u*, lj)\Y s , X s , (3, a, 0, p) = a 2 (r q {x\u*) - fc^'E^Cu*)) • 



(8) 



(9) 



The conditional Gp model ((SJ provides an analytical formula which can be directly 
used for sensitivity analysis, and more precisely to compute the Sobol indices. To simplify 
the notations, the conditional Gp (Y \Y S , X s , (3, a, 6, p) will now be written in a simplified 
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form: Y Gp]Ys >Xg (A, ui). 



2.3 Sobol indices 

Methods based on variance decomposition aim at determining the part of the variance of 
the output Y(x) resulting from each variable Xi, i = 1, . . . , d. A global measure of the 



sensitivity of Y(x) to each input ccj is given by the first order Sobol index (Sobol 



Saltelli et al. 



29, 



VMx u ..,x d [Y] 

These indices have been defined for deterministic functions Y of the inputs X\ , . . . , Xd 
but, in the case of the conditional Gp model, we have a stochastic function of the inputs. 
A first solution is applying the Sobol index formula to the predictor, i.e. the mean of 
the conditional Gp (Eq. ([5])) which is a deterministic function of the inputs. Analytical 
calculations are developed by Chen et al. [4]. The second approach that we consider 
consists in using the whole global conditional Gp by taking into account not only the 
mean of conditional Gp model but also its covariance structure as Oakley & O'Hagan 2C| 



did. In this case, when the Sobol definition is applied to the global Gp model, a random 
variable is obtained and constitutes a new sensitivity measure. Its expectation can be 
then considered as a sensitivity index. Its variance and more generally its distribution 
can then be used as an indicator of sensitivity index accuracy. 
To sum up, the two approaches can be defined as follows: 

• Approach 1: Sobol indices computed with the predictor-only 

V a r Xi E Xu ... tXd [E a [Y G (X^)]\X t ] 
Si = — — = — — ; for i = 1, . . . , a. (10 

Vax Xl ,... iXd En[Y GplYsXa (X,u)} V ' 

• Approach 2: Sobol indices computed with the global Gp model 

5( , V^E Xl _ x AY G (X,uj)\XJ 

Si(uJ) = i\r nr m tor i=1,..., d. (11) 

Si(tJ) is then a random variable; its mean can be considered as a sensitivity index 
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PS. 



and its variance as an indicator of its accuracy: 

EnVar Xi E Xu ... !Xd [Y GplYa Xs {X,u)\Xi] 

\Y a ,X s 

Var n Var Xi E Xu ...,x d [ Y Gp\Y B ,x B ( X ^)\ X ' 
(E n V a r Xu ..., Xd {Y Gm3 . x jX,uW 



for i = 1, . 



for i = 1, . 



(12) 



Our work focuses on the computation and the study of the sensitivity indices defined 
following the two approaches, respectively Si and /Xg. . We will also propose a methodology 
to numerically simulate the probability distribution of Si- Then, a study to compare the 
accuracy and the robustness of the two indices is made on several test functions and the 
use of the distribution of Si is illustrated to build confidence intervals. 



3 IMPLEMENTATION OF SOBOL INDICES 

3.1 Estimation of Gp parameters 

First of all, to build the conditional Gp defined by Eq. (J5J), regression and correlation 
parameters (often called hyperparameters) have to be determined. Indeed, the Gp model 
is characterized by the regression parameter vector (3, the correlation parameters (0,p) 
and the variance parameter er 2 . The maximum likelihood method is commonly used to 
estimate these parameters from the learning sample (X s , Y s ). 

Several algorithms have been proposed in previous papers to numerically solve the 
maximization of likelihood. Welch at al. 3l| use the simplex search method and introduce 



a kind of forward selection algorithm in which correlation parameters are added step 
by step to increase the log-likelihood function. In Kennedy and O'Hagan's GEM-SA 
software (O'Hagan [2l|), which uses the Bayesian formalism, the posterior distribution 
of hyperparameters is maximized, using a conjugate gradient method (the Powel method 
is used as the numerical recipe). The DACE Matlab free toolbox (Lophaven et al. 
uses a powerful stochastic algorithm based on the Hooke & Jeeves method (Bazaraa et al. 
Q), which requires a starting point and some bounds to constrain the optimization. In 
complex applications, Welch's algorithm reveals some limitations and for complex model 
with high dimensional input, GEM-SA and DACE software cannot be applied directly on 
data including all the input variables. To solve this problem, we use a sequential version 
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(inspired by Welch's algorithm) of the DACE algorithm. It is based on the step by step 
inclusion of input variables (previously sorted). This methodology, described in details 
in Marrel et al. [15| . allows progressive parameter estimation by input variables selection 
both in the regression part and in the covariance function. 

3.2 Computation of Sobol indices for the two approaches 

To perform a variance-based sensitivity analysis for time consuming computer models, 
some authors propose t o ap proximate the computer code by a metamodel (neural networks 
in Martin & Simpson 16J, polynomials in Iooss et al. 10j, boosting regression trees in 
Volkova et al. [29(). For metamodels with sufficient prediction capabilities, the bias due 
to the use of the metamodel instead of the true model is negligible (Jacques ll|). The 



mctamodel's predictor have to be evaluated a large number of times to compute Sobol 
indices via Monte Carlo methods. Recent works based on polynomial chaos expansions 
(Sudret j^ ) have used the special form of this orthogonal functions expansion to derive 
analytical estimation of Sobol indices. However, the modeling error of this metamodel is 
not available and then has not been integrated inside the Sobol index estimates. 

The conditional Gp metamodel provides an analytic formula which can be easily used 
for sensitivity analysis in an analytical way. Moreover, in the case of independent inputs 
and with a covariance which is a product of one-dimensional covariances (Eq. ([!])), the 
analytical formulae of Si and fig. (respectively Eqs. (fTT))) and (fl2")) ) lead to numerical 
integrals, more precisely to respectively one-dimensional and two-dimensional integrals. 
The accuracy of these numerical integrations can be easily controlled and are less computer 
time expensive than Monte Carlo simulations. Few analytical developments jaf Sobol 
indices computation (for Si, fig. and er| ) can be found in Oakley & O'Hagan 



20] 



3.3 Simulation of the distribution of Si 

For the second approach where Si is a random variable, the distribution of Si is not 
directly available. By taking the mean related to all the inputs except Xj, the main effect 
of Xi is defined and denoted A(Xi,ui): 

A(Xi,cu) = E Xl ,...,x d [ Y GWs,x s ( x >")\ x i\- 
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A(Xi,oS) is still a Gaussian process defined on R x and characterized by its mean and 
covariance which can be determined in an analytical way by integrating the Gp model 
over all the inputs except Xi. In the case of independent inputs, one-dimensional integrals 
are obtained and can be numerically computed. Then, to obtain the Sobol indices, we 
consider the variance related to Xi of the Gaussian process defined by the centered main 
effect. This variance is written 

A(xi,oj) - J A(xi,uj)drj x ^j dr\ Xi 

with drj Xi the probability density function of the input Xi defined on [a, ; bi]. This last 
expression is a one-dimensional random integral which has to be discretized and approxi- 
mated by simulations. 

The discretization of this random integral over the space of Xi leads to a Gaussian 
vector of n dis elements: 

V n ^(u) = (A(a i ,u) i A(ai+ ( h ~ ai \ u>), A(a t + ( " dis ~ 1} (b t - ca),u), Afau))* 

\ ft'dis ^dis / 

The mean and covariance matrix of this vector are computed using those of the Gaussian 
process A[Xi,ui). The random vector V ndi is then multiplied by the matrix related 
to the numerical scheme used to compute the integral (rectangle or trapeze method, 
Simpson's formula ...). The Gaussian vector obtained from this multiplication is denoted 
V^.. s . To simulate it, we use the well known simulation method based on the Cholesky 
factorisation of the covariance matrix (Cressie [6|). We simulate a n d i S -size centered and 
reduced Gaussian vector and multiply it by the triangular matrix from the Cholesky 
decomposition. Then, an evaluation of the random integral which constitutes a realization 
of Si is computed from the simulation of the vector V n ,. . This operation is done A: sim 
times to obtain a probability distribution for Si ■ It can be noted, that only one Cholesky 
factorization of the covariance matrix of the n d i S -size vector is necessary, and used for all 
the k s im simulations of Si . To determine if the discretization number n dis and the number 
of simulations fc sim are sufficient, the convergence of the mean and variance of Si can be 
studied. Indeed, their values can be easily computed following their analytical expressions 

OH- 



IO 



4 APPLICATIONS 

4.1 Comparison of Si and fj,§. 

To compare and study the behavior of the two sensitivity indices Si and fig. , we consider 
several test functions where the true values of Sobol indices are known. Comparisons be- 
tween the two approaches are performed in terms of metamodel predictivity, i.e. relatively 
to the accuracy of the Gp model, constructed from a learning sample. This accuracy is 
represented by the predictivity coefficient Q 2 - It corresponds to the classical coefficient of 
determination R 2 for a test sample, i.e. for prediction residuals: 



1 ~ o 



Q 2 (Y, Y) = 1 — 

where Y denotes the n tcst true observations of the test set and Y is their empirical mean. 
Y represents the Gp model predicted values. To obtain different values of Q 2 , we simulate 
different learning samples with varying size n. For each size n, a Latin Hypercube Sample 
of the inputs is simulated (McKay et al. 13 ]) to give the matrix X s (n rows, d columns). 
Then, the test function is evaluated on the n data points to constitute (X S ,Y S ) and a 
conditional Gp model is built on each learning sample. For each Gp model built, the 
predictivity coefficient Q 2 is estimated on a new test sample of size 10000 and the two 
sensitivity indices Si and /Us. are computed. For each value of the learning sample size 
n, all this procedure, i.e. Gp modeling and estimation of sensitivity indices, is done 100 
times. Consequently, the empirical mean, 0.05-quantile and 0.95-quantile of Si and ^ig. 
are computed for same values of learning sample size n, and similar approximate values 
of Q 2 . 

4.2 Test on the g-function of Sobol 

First, an analytical function called the g-function of Sobol is used to compare the Sobol 
indices Si based on the predictor and the Sobol indices /x^. based on the global Gp model. 
The g-function of Sobol is defined for d inputs (X±, . . . , Xd) uniformly distributed on 
[0, if: 

.gsoboi(^i, ■ ■ - ,X d ) = TT g k {X k ) where g k {X k ) = ^ Xk 2 I + Q fc and flfe > 

L\ 1 + ak 
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Because of its complexity (considerable nonlinear and non-monotonic relationships) and 
to the availability of analytical sensitivity indices, it is a well known test example in the 
studies of global sensitivity analysis algorithms (Saltelli et al. The importance of 

each input Xk is represented by the coefficient a^. The lower this coefficient a^, the more 
significant the variable Xk- The theoretical values of first order Sobol indices are known: 

1 

c 3(l+a;) 2 „ . , 

1 = rT d — i for « = l, . . . , d. 

llfc=l 3(l+a fc ) 2 

For our analytical test, we choose d = 5 and — k for k = 1, . . . , 5.. 

Let us recall that we study only first order sensitivity indices. For each input Xj, the 
convergence of Si and fig in function of the predictivity coefficient Qi is illustrated in 
figure [TJ The convergence of sensitivity index estimates to their exact values in function 
of the metamodel predictivity is verified. In practical situations, a metamodel with a 
predictivity lower than 0.7 is often considered as a poor approximation of the computer 
code. TableQ]shows the connection between the learning sample size n and the predictivity 
coefficient Qi- As the simulation of a learning sample and its Gp modeling arc done 100 
times for each value of n, the mean and the standard deviation of Qi are indicated. 

[Table 1 about here.] 

Figure [1] also shows how the global Gp model outperforms the predictor-only model by 
showing smaller confidence intervals for the five sensitivity indices. 

[Figure 1 about here.] 

To sum up the convergence of the indices for the different inputs, it can be useful to 
consider the error between the theoretical values of Sobol indices Sl he ° and the estimated 
ones in Li norm: 

err i2 =^(5f eo -S t ) 2 (13) 
i=i 

where Si denotes the indices estimated with one of the two methods (Si — Si or Si ~ fig. ). 
Figure [5] illustrates this convergence in function of the learning sample size n and in 
function of the predictivity coefficient Qi. 

[Figure 2 about here.] 

From Figure [U we conclude that the sensitivity indices defined using the global Gp 
model (fig.) are better in mean than the one estimated with the predictor only (Si). This 
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difference between the two approaches is especialiy significant for high values of Sobol 
indices like the indices related to the first input (Si and )■ For lower indices, these 
two approaches give in mean the same results. Even if the two sensitivity indices seem 
to have the same rate of convergence in function of n or Q 2 , it is important to notice 
that the second approach is more robust. Indeed, fj,s. has a lower sampling deviation and 
variability than Si. Besides, this higher robustness is more significant when the accuracy 
of the metamodel is weak (Q2 < 0.8). So, taking into account the covariance structure of 
the Gp model appears useful to reduce the variability of the estimation of the sensitivity 
index. 



4.3 Test on Ishigami function 

We now consider another analytical function currently used in sensitivity studies (Saltclli 



et al. 



24j). the Ishigami function, where each of the three input random variables 



(Xi, X2, X3) follows a uniform probability distribution on [— tt, +tt]: 



f bi u S {X 1 ,X 2 ,X 3 ) = sinpd) + 7sin 2 (X 2 ) + 0.1X 3 4 sin(Xi) 



The theoretical values of first order Sobol indices are known: 

51 = 0.3139 

5 2 = 0.4424 

53 -0 

Like for the g-function of Sobol, the error with the theoretical values of Sobol indices 
in L 2 norm is computed for the two approaches for different learning sample size n and 
consequently for different values of Q 2 . As before (Eq. (fT"3")0 . the diagrams of convergence 
are shown in figure [3] 

[Figure 3 about here.] 

As observed for the g-function of Sobol, the indices defined with the global model are 
still more robust and less variable particularly for low values of Q 2 . However, the difference 
between the mean of the two indices is not significant. For high values of the Gp model 
accuracy (Q 2 > 0.8), the two approaches give the same values but the first one (with only 
the predictor) remains easier to compute. So, the use of the covariance structure through 
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the index Si seems to have a significant interest when the Gp metamodel is inaccurate or 
when few data are available to avoid too much variability of the estimated indices. 

4.4 Construction of confidence intervals for sensitivity indices 

As well as being more robust in mean, the index defined with the second approach Si 
has the advantage to have a variance easy to compute. More generally, it is possible to 
build a confidence interval of any level for this sensitivity index, using the methodology 
described in section 13.31 to simulate its distribution. This estimation of the uncertainty 
on the estimation of Sobol indices is another advantage of using the global Gp model: in 
practical cases with small learning sample size, only one Gp model is constructed. The 
predictivity coefficient Qi can be estimated by cross-validation or leave-one-out, and if Q2 
shows a low predictivity (typically less than 0.8), we wish to have some confidence in the 
estimation of Sobol indices computed from the Gp model. Contrary to Gp model, other 
metamodels do not allow to directly estimate the Sobol indices uncertainties due to the 
model uncertainties. 

To illustrate this, let us consider again the g-function of Sobol. Like in the previous 
section [4~2l we consider different sizes of the learning sample (from n — 20 to n = 50). For 
each value of n, we build a conditional Gp model and we control its accuracy estimating 
the Q2 on a test sample. We simulate the distribution of Si to obtain the empirical 0.05 
and 0.95-quantiles and consequently an empirical 90%-confidence interval. Then, we check 
if the theoretical values of Sobol indices belong to the empirical 90%-confidence interval. 
We repeat this procedure 100 times for each size n. Therefore, we are able to estimate the 
real level of our confidence interval and compare it to the 90% expected. The real levels 
obtained in mean for any size n and each input are presented in Table O 

[Table 2 about here.] 

For the high values of Sobol indices (Si and S2 for example), the observed levels of 
the 90%-confidence interval built from the simulation of the distribution of Si are really 
satisfactory and close to the expected level. In this case, the use of the global Gp model 
which gives confidence intervals for Sobol indices has a significant interest. On the other 
hand, for very low indices (close to zero), the Gp metamodel overestimates the Sobol 
indices. It explains the inaccuracy of the confidence interval. Indeed, without a procedure 
of inputs selection, each variable appears in the Gp metamodel and in its covariance. 
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Taking into account the variance leads to give a minimal bound for the influence of all 
the variables and consequently to overestimate the lowest Sobol indices. This tendency 
is confirmed by the observation of the mean of /ig. estimated for the three last inputs in 
Tabled 

We can make the same study with the Ishigami function for n = 30 to n = 130 
which induces a Q2 varying from 0.5 to 0.95. As all the procedure (i.e. learning sample 
simulation, Gp modeling and sensitivity analysis) is repeated 100 times for each size n, the 
convergence of the observed level of the empirical 90%-confidence interval can be observed 
in function of n. Similarly, we can study this convergence in function of Q2. Figured] 
shows all these diagrams of convergence. 

[Figure 4 about here.] 

As previously remarked on the g-function of Sobol, the 90%-confidence intervals are ef- 
ficient for the high values of Sobol indices (Si and S2 for example). For these indices, 
the observed level of the confidence interval converges to theoretical level 0.9. We can 
also notice that the predictivity quality of the Gp modeling which is required to obtain 
accurate confidence interval corresponds approximately to Q2 > 0.80. However, we judge 
that for Q2 > 0.6, the error is not too strong and the obtained 90%-confidence interval 
can be considered as a reliable and useful information. On the other hand, for very low 
indices (close to zero) , the problem of overestimating the Sobol indices still damages the 
accuracy of the interval confidence for any size n and any Q2- This remark is particularly 
true when the index is equal to zero (for example 53). 



4.5 Application on an hydrogeologic transport code 

The two approaches to compute the Sobol indices are now applied to the data obtained 
from the modeling of strontium 90 (noted 90 Sr) transport in saturated porous media 
using the MARTHE software (developed by BRGM, France). The MARTHE computer 
code models flow and transport equations in three-dimensional porous formations. In the 
context of an environmental impact study, the MARTHE computer code has been applied 
to the model of 90 Sr transport in saturated media for a radwaste temporary storage site 



in Russia (Volkova et al. [29j). One of the final purposes is to determine the short-term 
evolution of 90 Sr transport in soils in order to help the rehabilitation decision making. 
Only a partial characterization of the site has been made and, consequently, values of the 
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model input parameters are not known precisely. One of the first goals is to identify the 
most influential parameters of the computer code in order to improve the characterization 
of the site in a judicious way. To realize this global sensitivity analysis and because of 
large computing time of the MARTHE code, a Gp metamodel is built on the basis of a 
first learning sample. 

4.5.1 Data presentation 

The 20 uncertain model parameters are permeability of different geological layers compos- 
ing the simulated field (parameters f to 7) , longitudinal dispersivity coefficients (parame- 
ters 8 to 10), transverse dispersivity coefficients (parameters 11 to 13), sorption coefficients 
(parameters 14 to 16), porosity (parameter 17) and meteoric water infiltration intensities 
(parameters 18 to 20). To study sensitivity of the MARTHE code to these parameters, 
300 simulations of these 20 parameters have been made by the LHS method. For each 
simulated set of parameters, MARTHE code computes transport equations of 90 Sr and 
predicts the evolution of 90 Sr concentration for year 2010. For each simulation, the output 
we consider is the value of 90 Sr concentration, predicted for year 2010, in a piezometer 
located on the waste repository site. 

4.5.2 Gp modeling and computation of Sobol indices 

To model the concentration in the piezometer predicted by MARTHE code in 2010 in func- 
tion of the 20 input parameters, we fit a Gp metamodel conditionally to 300 simulations 
of the code. The regression and correlation parameters of the Gp model are estimated 
by maximum likelihood and a procedure of input selection is used. The input variables 
introduced in the metamodel are the sorption coefficient of the upper layer (parameter 14 
denoted kdl), an infiltration intensity (parameter 20 denoted i3) and the permeability of 
the upper layer (parameter 1 denoted perl). The accuracy of the Gp model is checked 
with the estimation of Q2 by a cross validation on the learning sample. The predictivity 
coefficient estimated is: Q2 — 0.92. From previous study (Marrel et al. 15|), we have 



found that the linear regression gives a Q2 = 0.69 and the metamodel based on boosting 
of regression trees gives a Q2 = 0.83. From laboratory measures and bibliographical infor- 
mation, prior distributions have been determined for the inputs kdl, i3 and perl and are 
respectively a Weibull, a trapezoidal and a uniform distributions. The parameters of these 
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distributions has been estimated or fixed a priori. Then, using the global Gp model, the 
Sobol indices defined by fig are computed (Eq. (fT2"| ) as well as the standard deviation 
as. and the 90%-confidence interval associated (cf . methodology 13 . 3[) . The results are 
presented in Table [3J with the Sobol indices obtained with the predictor-only approach 
and with the boosting predictor. 

[Table 3 about here.] 

The use of Gp model gives a better predictivity than the boosting of regression trees 
(respectively Qi = 0.92 and Q2 — 0.83) and consequently a more accurate estimation of 
Sobol indices. Besides, the Sobol indices estimated with the boosting model do not even 
belong to the confidence intervals given by the Gp model. Even if the sensitivity indices 
based on the predictor only and the ones estimated with the whole Gp model are very 
close, the second approach has the advantage to give confidence intervals and consequently 
to have a more rigorous analysis. 

Without considering their interactions, the 3 inputs kdl, i3 and perl explained nearly 
90% of the total variance of the output and the most influent input is clearly kdl, followed 
by j3 and perl. So, kdl is the most important parameter to be characterized in order to 
reduce the variability of the concentration predicted by MARTHE code. Using the whole 
Gp model, we also have an indication of the accuracy of Sobol indices. The standard 
deviation of the indices are small and increase the confidence in the value estimated 
{UShdV ^Si3 an< ^ i)- Moreover, the very small overlap of the 90%-confidence interval 
of the 3 indices indicates that the order of influence of the inputs is well determined and 
strongly confirms the predominance of kdl. So, the confidence intervals and the standard 
deviation obtained with the whole Gp model give more confidence in the interpretation 
of Sobol indices. 

Taking into account the variability of the Gp model via its covariance structure gives 
more robustness to the results and their analysis. However, this increase of precision and 
confidence has a numerical cost. Indeed, the number of numerical integrals being com- 
puted is of order 0(dn 2 ) where d is the number of inputs and n the number of simulations, 
i.e. the learning sample size. The numerical cost depends also on the numerical preci- 
sion required for the approximation of the integrals. Moreover, a high precision is often 
essential to provide the robustness of the computation of Sobol indices, especially when 
the distribution of the inputs is narrow and far from the uniform distribution (like the 
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Weibull distribution of kdl). In this last case, it can be judicious to adapt the numerical 
scheme in order to increase the precision in the region of high density. 

5 CONCLUSION 

We have studied the Gaussian process metamodel to perform sensitivity analysis, by esti- 
mating Sobol indices, of complex computer codes. This metamodel is built conditionally 
to a learning sample, i.e. to n simulations of the computer code. The Gp model proposes 
an analytical formula which can be directly used to derive analytical expressions of Sobol 
indices. Indeed, in the case of independent inputs and with our choice of regression and 
covariance functions, the formula of Gp model leads to one and two-dimensional numeri- 
cal integrals, avoiding a large number of metamodel predictor evaluations in Monte Carlo 
methods. The use of Gp model instead of other metamodel is therefore highly efficient. 
Another advantage of Gp metamodel stands in using its covariance structure to compute 
Sobol indices and to build associated confidence intervals, by using the global stochastic 
model including its covariance. 

On analytical functions, the behavior and convergence of the Sobol index estimates 
were studied in function of the learning sample size n and the predictivity of the Gp 
metamodel. This analysis reveals the significant interest of the global stochastic model 
approach when the Gp metamodel is inaccurate or when few data are available. Indeed, 
the use of the covariance structure gives sensitivity indices which are more robust and 
less variable. Moreover, all the distribution of the sensitivity index (defined as a random 
variable) can be simulated following an original algorithm. Confidence intervals of any 
level for the Sobol index can then be built. In our tests, the observed level of the interval 
was compared to the expected one on analytical functions. For the highest values of Sobol 
indices and under the hypothesis of a Gp metamodel with a predictivity coefficient larger 
than 60%, the confidence intervals are satisfactory. In this case, the use of the global Gp 
model which gives confidence intervals for Sobol indices has a significant interest. The 
only drawback is that the use of covariance structure has a tendency to give a minimal 
bound for the influence of all the variables and consequently to overestimate the lowest 
Sobol indices and to give inaccurate confidence intervals for very low indices (close to 
zero) . 

The use of covariance structure was also illustrated on real data, obtained from a 
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complex hydrogcological computer code, simulating radionuclide groundwater transport. 
This application confirmed the interest of the second approach and the advantage of Gp 
metamodel which, unlike other efficient metamodels (neural networks, regression trees, 
polynomial chaos, ...), gives confidence intervals for the estimated sensitivity indices. 
The same approach based on the use of the global Gp metamodel can be used to make 
uncertainty propagation studies and to estimate the distribution of the computer code 
output in function of the uncertainties on the inputs. 
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Figure 1: Convergence of sensitivity indices in function of the predictivity coefficient Q2 
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Figure 2: Error in L2 norm for sensitivity indices in function of n and Q2 (g-Sobol function). 



24 



Mean, 0.05-quantile and 0.95-quantile 



Predictor only 

Global Gp model 



25 30 35 40 45 50 55 60 

Learning sample size n 



70 75 



1.4 



1.2 



0.8 



O 

W 



0.6 



0.4 



0.2 



Mean, 0.05-quantile and 0.95-quantile 



Predictor only 

Global Gp model 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



Figure 3: Error in L2 norm for sensitivity indices in function of n and Q2 (Ishigami function). 
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Table 1: Connection between the learning sample size n and the predictivity coefficient Q2 
(g-Sobol function). 
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Variable Theoretical value Mean of fig. Observed level of the empirical 





of Sobol index 




confidence interval 




0.7164 


0.7341 


0.9381 


x 2 


0.1791 


0.1574 


0.9369 


x 3 


0.0237 


0.0242 


0.5830 


x 4 


0.0072 


0.0156 


0.8886 


x 5 


0.0001 


0.0160 


0.0674 



Table 2: Real observed level of the empirical 90%-confidence interval built with the Gp model 
for the Sobol index of each input parameter (g-Sobol function). 
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input variable 


Boosting of 


Predictor only 




Whole Gp model 




regression trees 


(Gp model) 










Si 


Si 




a s, ! 


)0%-confidence interval 


perl 


0.03 


0.081 


0.078 


0.020 


[ 0.046 ; 0.112 ] 


kdl 


0.90 


0.756 


0.687 


0.081 


[ 0.562 ; 0.825 ] 


i3 


0.03 


0.148 


0.132 


0.022 


[ 0.100 ; 0.170 ] 



Table 3: Estimated Sobol indices, associated standard deviation and confidence intervals for 
MARTHE data. 
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